You can work 100% in VS-Code (for both R and Python), no need to switch between VSC and R-studio
You can work through your cells one at a time and see the incremental progress, similar to using .ipynb with python or rmd in R-studio.
With Quarto’s convert command, you can re-format and jump between the different file-formats. For example,
quarto convert HW-2.rmd will convert the file to HW-2.ipynb
quarto convert HW-2.ipynb will convert the file to HW-2.qmd, which can be renamed HW-2.rmd or just opened in R-studio like any other rmd file, just like normal.
quarto render HW-2.ipynb will render the notebook (either R or Python) into an aesthetically pleasing output.
Import
Code
#knitr::opts_chunk$set(include = FALSE) # for making promptsknitr::opts_chunk$set(echo =TRUE) # for making solutionsknitr::opts_chunk$set(fig.width =3)library(knitr)library(tidyverse)library(modeldata)library(leaps)library(caret)library(corrplot)library(MASS)library(ISLR)library(glmnet)library(gam)library(ggplot2)library(dplyr)
HW-3.1: Stock returns
Use polynomials and ridge regression to predict stock returns
This problem uses the built in EuStockMarkets dataset. The dataset contains time series of closing prices of major European stock indices from 1991 to 1998. We use only the FTSE column in this problem. The dataset is a time series object, but you will need to extract the FTSE column and make it into a data frame.
HW-3.1a:
Fit polynomial models of degrees 4, 8, 12 to the FTSE data and plot all three fitted curves together with a scatterplot of the data. Comment on the plots. Which features in the data are resolved by the polynomial models? Which features are not resolved? Do the polynomial curves show any artifacts such as oscillations?
Code
# GET DATAdata("EuStockMarkets")print(head(EuStockMarkets))
# INSERT CODE HERE # Load my thememy_theme <-readRDS('~/.Rthemes/my_theme.rds')# make ftse a dfftse_df <-data.frame(date =seq(as.Date("1991-01-01"), by ="day", length.out =length(ftse)), FTSE =as.vector(ftse))# Modelsfit4 <-lm(FTSE ~poly(date, 4, raw=FALSE), data = ftse_df)fit8 <-lm(FTSE ~poly(date, 8, raw=FALSE), data = ftse_df)fit12 <-lm(FTSE ~poly(date, 12, raw=FALSE), data = ftse_df)# Predictionspred4 <-predict(fit4, newdata = ftse_df)pred8 <-predict(fit8, newdata = ftse_df)pred12 <-predict(fit12, newdata = ftse_df)
Code
# Plotggplot(ftse_df, aes(x = date)) +geom_point(aes(y = FTSE)) +geom_line(aes(y = pred4, color ="Degree 4"), size =1.5) +geom_line(aes(y = pred8, color ="Degree 8"), size =1.5) +geom_line(aes(y = pred12, color ="Degree 12"), size =1.5) +scale_color_manual(name ="Degree of Poly",values =c("Degree 4"="pink", "Degree 8"="royalblue1", "Degree 12"="darkolivegreen1")) +labs(title ="FTSE over time & Fitted Polynomial Curves",x ="Date", y ="FTSE") +my_theme() +theme(panel.grid =element_line(color ="grey"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
The best one seems to be the one of degree 12, followed by the one of degree 8 and then 4. That is because the higher degree the more it can follow the patterns over time. However, while all of them seem to follow the trend, they all lack the ability to capture the ups and downs over time (probably due to seasons or months in the year). They show some osculations (the higher degree the more) but still cannot capture the complexity that times adds into the data. However, some oscillation are not in the data but they are shown, especially for degree 8 and 12, where they drop down at the beggining and at the end of the dataset. These are signs of some over fitting.
HW-3.1.b:
Use ridge regression to regularize the polynomial model of degree 12. Use \(\lambda_{1}SE\). Plot the resulting polynomial model onto the the data and comment on it.
Code
# Make model matrix (degree 12)features <-model.matrix(~poly(as.numeric(ftse_df$date), 12, raw =FALSE), data = ftse_df)[, -1] set.seed(444)# Cross Validationcv <-cv.glmnet(features, ftse_df$FTSE, alpha =0)# lambda 1se (optimal lambda with 1 standard error rule)lambda_1SE <- cv$lambda.1se # model & predictionsridge_model <-glmnet(features, ftse_df$FTSE, alpha =0)ridge_predictions <-predict(ridge_model, s = lambda_1SE, newx = features)# Plotplot(ftse_df$date, ftse_df$FTSE, type ="l", xlab ="Date", ylab ="FTSE", main ="Ridge Regression on Degree 12 Polynomial")lines(ftse_df$date, ridge_predictions, col ="red")legend("bottomright", legend =c("Actual", "Ridge Prediction"), col =c("black", "red"), lwd =2, bty ="n")
Some over fitting has been cut down through Ridge regression, but still have the endings of the model drop dramatically, which might be a problem for predicting future data. Ridge regression, while penalizing the magnitude of the coefficients, doesn’t seem to fully work for what was mentioned.
HW-3.2: Advertising budgets
Improve advertising budgets using GAMs
Use the Advertising dataset, which can either be found here. Split the data into a training and test set (70% / 30%).
Code
# GET DATAset.seed(441)ads <-read_csv('https://www.statlearning.com/s/Advertising.csv')ads <- ads[,-1]# Splitting datatrainIndex <-sample(seq_len(nrow(ads)), size =floor(0.7*nrow(ads)))train_data <- ads[trainIndex, ]test_data <- ads[-trainIndex, ]# For our checkingpaste("rows of training data:", nrow(train_data), "rows of testing data:", nrow(test_data))
[1] "rows of training data: 140 rows of testing data: 60"
HW-3.2a:
Fit generalized additive models to predict sales, using smoothing splines of degrees 2, 3, 4, 5, 6 for the three predictors. How do the rms prediction errors compare to the rms prediction error of a multiple regression model on the training set? On the test set?
Code
# INSERT CODE HERE # Linear Modellm_model <-lm(sales ~ TV + radio + newspaper, data = train_data)# Predictions on test and traininglm_predictions_train <-predict(lm_model, newdata = train_data)lm_predictions_test <-predict(lm_model, newdata = test_data)# RMSlm_rms_train <-sqrt(mean((lm_predictions_train - train_data$sales)^2))lm_rms_test <-sqrt(mean((lm_predictions_test - test_data$sales)^2))print(paste("Linear Model RMS on Training Set:", lm_rms_train))
[1] "Linear Model RMS on Training Set: 1.58165007636303"
Code
print(paste("Linear Model RMS on Test Set:", lm_rms_test))
[1] "Linear Model RMS on Test Set: 1.93558979182608"
Code
# INSERT CODE HERE # Set up the degrees, predictors, and storing for models and namesdegrees <-2:6predictors <-c("TV", "radio", "newspaper")gam_mods <-list()mod_names <-vector()# Loop through all degreesfor (degree in degrees) {# Formula gam_form <-as.formula(paste("sales ~", paste("ns(", predictors, ", df =", degree, ")", collapse =" + ")))# Fit model and store gam_mod_name <-paste("GAM_Degree", degree, sep ="_") gam_mods[[gam_mod_name]] <-gam(gam_form, data = train_data) mod_names <-c(mod_names, gam_mod_name)}# RMSE functioncalculateRMSE <-function(actual, predicted) {sqrt(mean((actual - predicted)^2))}# to store RMSErmse_train <-setNames(numeric(length(mod_names)), mod_names)rmse_test <-setNames(numeric(length(mod_names)), mod_names)# RMSE for each GAM model (training)for (mod_name innames(gam_mods)) { predictions1 <-predict(gam_mods[[mod_name]], newdata = train_data, type ="response") rmse_train[mod_name] <-calculateRMSE(train_data$sales, predictions1)}# RMSE for each GAM model (testing)for (mod_name innames(gam_mods)) { predictions2 <-predict(gam_mods[[mod_name]], newdata = test_data, type ="response") rmse_test[mod_name] <-calculateRMSE(test_data$sales, predictions2)}# RMSE valuesprint("RMSE for each model on the train set:")
For the training set, all the errors are lower than the one for multiple regression, and they lower even more the higher the degree, which is expected. However, in the test set, the model with degree 2 seems to have about or a tiny bit over the error of the multiple regression model while the other ones have lower. The one with degree 3 seems to do the best while then, the error slowly increases for the following degrees.
HW-3.2.b:
Is there evidence of overfitting?
INSERT EXPLANATION HERE
While there is not a clear sign of over fitting, as the GAM models have lower testing errors than the multiple regression one, degree 3 seems to do the best, indicating that higher degrees might do some over fitting in comparison to that one, but the difference is small.
HW-3.2.c:
You now have six models (five GAM and one LM). Which model should be used? Explain your answer.
INSERT EXPLANATION HERE
The model GAM with degree 3 has the lowest RMSE for the testing error, also having the lowest degree (after 2). Both indicate that it probably extrapolates better to the real world and it is the simplest model (after the one of degree 2), which is also good. Additionally, it doesn’t show signs of overfitting.
HW-3.3: Boston housing
Use LASSO to predict housing prices in Boston
Consider the Boston data from the MASS package. We want to use LASSO to predict the median home value medv using all the other predictors.
HW-3.3.a:
Set up the LASSO and plot the trajectories of all coefficients. What are the last five variables to remain in the model?
Note: This was lab 3.1 question 2.
Code
# INSERT CODE HERE data("Boston")# data as matrixx <-as.matrix(Boston[, -which(names(Boston) =="medv")])y <- Boston$medvset.seed(444)# split data and set as matrixtrain_indices <-sample(1:nrow(Boston), size =0.8*nrow(Boston))train_data <- Boston[train_indices, ]test_data <- Boston[-train_indices, ]x_train <-as.matrix(train_data[, -which(names(train_data) =="medv")])y_train <- train_data$medvx_test <-as.matrix(test_data[, -which(names(test_data) =="medv")])y_test <- test_data$medv
Code
# INSERT CODE HERE # modellasso_model <-glmnet(x_train, y_train, alpha =1)# plotplot(lasso_model, xvar ="lambda", label =TRUE)title("Trajectories of coefficients", line =3)
Code
# get the lambdaslambda_seq <-exp(seq(log(max(lasso_model$lambda)), log(min(lasso_model$lambda)), length.out =100))# to save the variableslast_five_vars <-character()# for all lambdasfor (lambda in lambda_seq) {# get the coreficients for each lasso_coef <-predict(lasso_model, type ="coefficients", s = lambda)# make it as numeric coef_vector <-as.numeric(lasso_coef[-1, , drop =FALSE])# cannot be zero non_zero_coefs <-which(coef_vector !=0)if (length(non_zero_coefs) ==5) {# get last five without the intercept last_five_vars <-rownames(lasso_coef)[-1][non_zero_coefs]break}}cat("Last five variables to remain:", paste(last_five_vars, collapse =", "))
Last five variables to remain: chas, rm, ptratio, black, lstat
HW-3.3.b:
Find the 1SE value of \(\lambda\), using 10-fold cross-validation. What is the cross validation estimate for the residual standard error?
Code
# INSERT CODE HERE set.seed(444)# modelcv_lasso <-cv.glmnet(x_train, y_train, alpha =1, nfolds =10)# best lambdalambda_min <- cv_lasso$lambda.min# 1SElambda_1se <- cv_lasso$lambda.1se# get the indexindex_1se <-which.min(abs(cv_lasso$lambda - lambda_1se))# select for that lambdacvm_1se <-sqrt(cv_lasso$cvm[index_1se])cat("Lambda 1-SE:", lambda_1se, "\n")
Lambda 1-SE: 0.3625409
Code
cat("Cross-validation estimate for Residual Standard Error:", cvm_1se)
Cross-validation estimate for Residual Standard Error: 4.924468
HW-3.3.c:
Rescale all predictors so that their mean is zero and their standard deviation is 1. Then set up the LASSO and plot the trajectories of all coefficients. What are the last five variables to remain in the model? Compare your answer to part a.
Code
# INSERT CODE HERE # scalex_train_scaled <-scale(x_train) set.seed(444) # modellasso_model <-glmnet(x_train_scaled, y_train, alpha =1)# plotplot(lasso_model, xvar ="lambda", label =TRUE)title("Trajectories of coefficients", line =3)
Code
# INSERT CODE HERE lambda_seq <-exp(seq(log(max(lasso_model$lambda)), log(min(lasso_model$lambda)), length.out =100))last_five_vars <-character()for (lambda in lambda_seq) { lasso_coef <-predict(lasso_model, type ="coefficients", s = lambda) coef_vector <-as.numeric(lasso_coef[-1, , drop =FALSE]) non_zero_coefs <-which(coef_vector !=0)if (length(non_zero_coefs) ==5) { last_five_vars <-rownames(lasso_coef)[-1][non_zero_coefs]break}}cat("Last five variables to remain:", paste(last_five_vars, collapse =", "))
Last five variables to remain: chas, rm, ptratio, black, lstat
The last 5 variables are still the same. However, rescaling them has changed their trajectory as seen in the plot and also makes them easier to visualize
HW-3.3.d:
Find the 1SE value of \(\lambda\) using 10-fold cross-validation. What is the cross validation estimate for the residual standard error now? Does rescaling lead to a better performing model?
Code
# INSERT CODE HERE set.seed(444)lasso.out <-cv.glmnet(x_train_scaled, y_train, alpha =1, nfolds =10)# extract the lambda value that is one standard error away from the minimum lambdalambda_1se <- lasso.out$lambda.1se# calculate the cross-validated RSErse <-sqrt(lasso.out$cvm[lasso.out$lambda == lambda_1se])# print lambda value and RSEcat("1SE Lambda:", lambda_1se, "\nRSE:", rse)
1SE Lambda: 0.3625409
RSE: 4.924468
The error is the same and the lambda used too.
HW-3.4: Bike share usage
Predict bike share usage in Seoul using ridge and LASSO regressions
Access the dataset here. Filter the data to only include rows with “functioning days” == ‘Yes’. Next drop the columns Date, Hour, Seasons, and Holiday, and Functioning Day. Then drop any rows that have any missing values in any columns. Hint: You will need to rename some of the variable names because they include non-ASCII characters. This will help you later on.
Code
# GET DATA# Obtaining data and column namesbike <-read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', locale =locale(encoding ="UTF-8"), show_col_types =FALSE)colnames(bike)
# Filtering only Functioning dat = yesbike <- bike %>%filter(`Functioning Day`=="Yes")# Manually changing the names (other forms of doing it were not supported due to format)names(bike) <-c("Date", "Rented_Bike_Count", "Hour", "Temperature_C", "Humidity", "Wind_speed_ms", "Visibility_10m", "Dew_point_temperature_C", "Solar_Radiation_MJ_m2", "Rainfall_mm", "Snowfall_cm", "Seasons", "Holiday", "Functioning_Day")# Drop unwanted columnscols_to_drop <-c("Date", "Hour", "Seasons", "Holiday", "Functioning_Day")bike <- bike %>% dplyr::select(-one_of(cols_to_drop))# Drp NAsbike <-na.omit(bike)head(bike)
Run a linear regression to predict rented bike count using the remaining 8 variables in the dataset. Report the MSE and the most influential variables.
Code
# linear regression modellm_model <-lm(Rented_Bike_Count ~ ., data = bike)# extract MSEmse <-mean(lm_model$residuals^2)# get variable coefficientscoefs <-coef(lm_model)[-1] # Exclude intercept# identify influential variables (5)most_influential <-names(sort(abs(coefs), decreasing =TRUE)[1:5])# print MSE and vars print(paste("MSE:", mse))
[1] "MSE: 234111.046434225"
Code
print(cat("The 5 most influential variables:", most_influential))
The 5 most influential variables: Solar_Radiation_MJ_m2 Wind_speed_ms Rainfall_mm Snowfall_cm Temperature_CNULL
HW-3.4.b:
Fit a ridge regression model with the optimal \(\lambda\) chosen by cross validation. Report the CV MSE.
Code
# Data as matrixx <-as.matrix(subset(bike, select =-c(Rented_Bike_Count)))y <- bike$Rented_Bike_Count# fit ridge with CVridge_mod <-cv.glmnet(x, y, alpha =0)# Optimal lambdaopt_lambda <- ridge_mod$lambda.min# fit ridge model with optimal lambdaridge_fit <-glmnet(x, y, alpha =0, lambda = opt_lambda)ridge_pred <-predict(ridge_fit, s = opt_lambda, newx = x)# MSEcv_mse <-min(ridge_mod$cvm)# Printcat("CV MSE:", cv_mse)
CV MSE: 236017
HW-3.4.c:
Perform the same fit using a LASSO regression this time. Choose the optimal \(\lambda\) using cross validation. Report on the remaining variables in the model and the CV MSE. How does this performance compare to ridge and a plain linear model?
Code
# INSERT CODE HERE # Fit Lassocv_lasso <-cv.glmnet(x, y, alpha =1, type.measure ="mse")# Optimal lambdaoptimal_lambda_lasso <- cv_lasso$lambda.mincat("Optimal lambda for LASSO:", optimal_lambda_lasso, "\n")
Optimal lambda for LASSO: 0.9379857
Code
# MSEcv_mse_lasso <-min(cv_lasso$cvm)paste("CV MSE for LASSO with optimal lambda:", cv_mse_lasso)
[1] "CV MSE for LASSO with optimal lambda: 234598.796445585"
Code
# Model with optimal lambdalasso_model <-glmnet(x, y, alpha =1, lambda = optimal_lambda_lasso)# Coefficientscoef_lasso <-coef(lasso_model)# Remaining varsremaining_vars <-rownames(coef_lasso)[coef_lasso[,1] !=0]# Print remaining Varscat("Remaining variables in the LASSO model:\n", remaining_vars)
Remaining variables in the LASSO model:
(Intercept) Temperature_C Humidity Wind_speed_ms Visibility_10m Solar_Radiation_MJ_m2 Rainfall_mm Snowfall_cm
Code
print("Coefficients:")
[1] "Coefficients:"
Code
print(lasso_coefs<-coef(cv_lasso, s = optimal_lambda_lasso, exact =TRUE))
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 8.523407e+02
Temperature_C 3.634304e+01
Humidity -1.052416e+01
Wind_speed_ms 5.231777e+01
Visibility_10m 3.064328e-03
Dew_point_temperature_C .
Solar_Radiation_MJ_m2 -1.141101e+02
Rainfall_mm -5.273742e+01
Snowfall_cm 3.350817e+01
This model has a lower MSE than Ridge but slightly higher than the plain linear model
HW-3.4.e:
Interpretation and communication. Write a short paragraph about your analysis and recommendations, explaining the most important factors for high bike share usage, why you came to that conclusion, and what actions can be taken by a bike rental company based on this information.
INSERT EXPLANATION HERE
By analyzing the data and all the models ran, the MSE is the lowest for the plain linear model, followed by the LASSO model and then the Ridge. This suggests that we would want to use the plain linear one probably, even though they all have very similar values. The most significant values are Temperature, Humidity, Wind Speed, Visibility, Solar Radiation, Rainfall, and Snowfall for the LASSO one, giving us some insight of the most important variables to predict rented bike count. We can see that Humidity, Solar radiation, and rainfall negatively impact the bike rentals while the rest mentioned affect it positively. This is useful information for the bike rental company to take into account when making decisions such as location where opening new rentals, amount of people to hire in different seasons, prepare differently depending on the weather forecast, or strategically place marketing campains.
HW-3.5: Splines
Compare the characteristics of two different smoothing splines
Consider two curves called \(\hat{g}_1\) and \(\hat{g}_2\) are as follows:
\[
\hat{g}_2 = argmin_g \left(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(4)}(x)]^2 \right)
\] where \(g^{(m)}\) represents the \(m^{th}\) derivative of \(g\).
HW-3.5.a:
As \(\lambda \to \infty\), which function (\(\hat{g_1}\) or \(\hat{g_2}\)) will have the smaller training RSS?
INSERT EXPLANATION HERE
As lambda goes to infinity, the penalty becomes dominant, pushing the derivatives (\(g^{(3)}\) for \(\hat{g}_1\) and \(g^{(4)}\) for \(\hat{g}_2\)) towards zero. However, for \(\hat{g}_1\) it forces the third derivative to 0 while for \(\hat{g}_2\) forces the forth derivative to 0. This makes a difference as it tends \(\hat{g}_1\) to a polynomial of degree 2 (quadratic) and \(\hat{g}_2\) to a polynomial of degree 3 (cubic). Thus, \(\hat{g}_2\) will be able to capture more noise probably and thus have a lower RSS.However, this always depends on the shape of the data and since they will be extremely smooth, they will probably have close values.
HW-3.5.b:
As \(\lambda \to \infty\), which function (\(\hat{g_1}\) or \(\hat{g_2}\)) will have the smaller test RSS?
INSERT EXPLANATION HERE
From before, we know what \(\hat{g}_1\) and \(\hat{g}_2\) tend to. Now, it is worth noting that as lambda goes to infinity, it makes them extremely smooth, probably causing under fitting. The simpler model may or may not extrapolate better to the real world and this, indeed, would depend on the nature of the data. However, the expected difference should be extremely small due to their extreme smoothness.
HW-3.5.c:
For \(\lambda = 0\), which function (\(\hat{g_1}\) or \(\hat{g_2}\)) will have the smaller training and test RSS?
INSERT EXPLANATION HERE
When lambda equals 0, the whole integral being multiplied by lambda sets to 0 as it is multiplied by 0 (lambda). This makes the penalty term 0 permitting the splines to fit the data as closely as posible. The term on the left of the equation of both (\(\hat{g_1}\) and \(\hat{g_2}\)) is exactly the same, meaning \(\hat{g_1}\) and \(\hat{g_2}\) will output the same result and thus, have the same RSS.
HW-3.6:
Explain the behavior of the curve for a variety of \(\lambda\) and \(m\) values.
Suppose a curve \(\hat{g}\) is fit smoothly to a set of \(n\) points as follows:
where \(g^{(m)}\) is the \(m\)th derivative of \(\hat{g}\) and \(g^{(0)}=g\). Provide plots of \(\hat{g}\) in each of the following scenarios along with the original points provided.
Use the following starter code to make your set of points and plot your various model predictions.
# INSERT SOLUTION HERE # Small approximating infinity (biggest I could run)lambda <-1e8# mm <-0# fit smoothing spline to datafit <-smooth.spline(df$X, df$Y, lambda = lambda)# Predict using fitted smoothing spline with derivative order mpreds <-predict(fit, df$X, deriv = m)# predictions as dfpreds_df <-as.data.frame(preds)# plotggplot(df, aes(x = X, y = Y)) +geom_point(alpha =0.5) +stat_function(fun = generating_fn, aes(col ="Generating Function")) +geom_line(data = preds_df, aes(x = x, y = y, col ="Fitted Spline")) +theme(legend.position ="right", legend.title =element_blank())
HW-3.6.b:
\(\lambda = \infty, m = 1\).
Code
# INSERT SOLUTION HERE # Small approximating infinity (biggest I could run)lambda <-1e8# mm <-1# fit smoothing spline to datafit <-smooth.spline(df$X, df$Y, df =100, lambda = lambda)# Predict using fitted smoothing spline with derivative order mpreds <-predict(fit, df$X, deriv = m)# predictions as dfpreds_df <-as.data.frame(preds)# Plotggplot(df, aes(x = X, y = Y)) +geom_point(alpha =0.5) +# plot original pointsstat_function(fun = generating_fn, aes(col ="Generating Function")) +geom_line(data = preds_df, aes(x = x, y = y, col ="Fitted Spline")) +theme(legend.position ="right", legend.title =element_blank())
HW-3.6.c:
\(\lambda = \infty, m = 2\).
Code
# INSERT SOLUTION HERE # Small approximating infinity (biggest I could run)lambda <-1e8# mm <-2# fit smoothing spline to datafit <-smooth.spline(df$X, df$Y, df =100, lambda = lambda)# Predict using fitted smoothing spline with derivative order mpreds <-predict(fit, df$X, deriv = m)# predictions as dfpreds_df <-as.data.frame(preds)# Plotggplot(df, aes(x = X, y = Y)) +geom_point(alpha =0.5) +stat_function(fun = generating_fn, aes(col ="Generating Function")) +geom_line(data = preds_df, aes(x = x, y = y, col ="Fitted Spline")) +theme(legend.position ="right", legend.title =element_blank())
HW-3.6.d:
\(\lambda = \infty, m = 3\).
Code
# INSERT SOLUTION HERE # Small approximating infinity (biggest I could run)lambda <-1e8# mm <-3# fit smoothing spline to datafit <-smooth.spline(df$X, df$Y, df =100, lambda = lambda)# Predict using fitted smoothing spline with derivative order mpreds <-predict(fit, df$X, deriv = m)# predictions as dfpreds_df <-as.data.frame(preds)# Plotggplot(df, aes(x = X, y = Y)) +geom_point(alpha =0.5) +stat_function(fun = generating_fn, aes(col ="Generating Function")) +geom_line(data = preds_df, aes(x = x, y = y, col ="Fitted Spline")) +theme(legend.position ="right", legend.title =element_blank())
HW-3.6.e:
\(\lambda = 0, m = 3\).
Code
# INSERT SOLUTION HERE # lamda = 0lambda <-0# mm <-3# fit smoothing spline to datafit <-smooth.spline(df$X, df$Y, lambda = lambda)# Predict using fitted smoothing splinepreds <-predict(fit, df$X)# predictions as dfpreds_df <-as.data.frame(preds)# Plotggplot(df, aes(x = X, y = Y)) +geom_point(alpha =0.5) +stat_function(fun = generating_fn, aes(col ="Generating Function")) +geom_line(data = preds_df, aes(x = x, y = y, col ="Fitted Spline")) +theme(legend.position ="right", legend.title =element_blank())
HW-3.6.f:
Fit a smoothing spline on the dataset and report the optimal lambda
Code
# INSERT SOLUTION HERE # Optimal fit splineoptimal_fit <-smooth.spline(df$X, df$Y)# optimal lambda defined by our fitprint(paste("Optimal lambda:", optimal_fit$lambda))
[1] "Optimal lambda: 0.000111660415400747"
Code
# Predict using fitted smoothing splinepreds <-predict(optimal_fit, df$X)# predictions as dfpreds_df <-as.data.frame(preds)# Plotggplot(df, aes(x = X, y = Y)) +geom_point(alpha =0.5) +stat_function(fun = generating_fn, aes(col ="Generating Function")) +geom_line(data = preds_df, aes(x = x, y = y, col ="Fitted Spline")) +theme(legend.position ="right", legend.title =element_blank())